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Delay-Coordinates Embeddings as a Data Mining 
Tool for Denoising Speech Signals. 

D. Napoletani 1 , C.A. Berenstein 2 , T. Sauer 3a , D.C. Struppa 3b and D. Walnut 3c . 



Abstract — In this paper we utilize techniques from the theory 
of non-linear dynamical systems to define a notion of embedding 
threshold estimators. More specifically we use delay-coordinates 
embeddings of sets of coefficients of the measured signal (in 
some chosen frame) as a data mining tool to separate structures 
that are likely to be generated by signals belonging to some 
predetermined data set. We describe a particular variation of 
the embedding threshold estimator implemented in a windowed 
Fourier frame, and we apply it to speech signals heavily cor- 
rupted with the addition of several types of white noise. Our 
experimental work seems to suggest that, after training on the 
data sets of interest, these estimators perform well for a variety 
of white noise processes and noise intensity levels. The method 
is compared, for the case of Gaussian white noise, to a block 
thresholding estimator. 

Index Terms — Threshold estimators, delay-coordinates embed- 
dings, nonlinear systems, data-driven denoising. 



I. Introduction 

In this paper we explore the performance of a method of 
denoising that is designed to be efficient for a variety of white 
noise contaminations and noise intensities, while keeping a 
fixed choice of parameters of the algorithm itself (adapted to 
the class of signals to denoise). The method is based on a 
loose distinction between the geometry of delay-coordinates 
embeddings of, respectively, deterministic time series and 
non-deterministic ones. Delay-coordinates embeddings are 
the basis of many applications of the theory of non-linear 
dynamical systems, see for example [ASY] or [KS], our work 
stands apart from previous applications of embeddings in that 
no exact modelization of the underlyning signals (through the 
delay-coordinates embeddings) is needed nor attempted here. 
Instead, we measure the overall 'squeezing' of the dynamics 
along the principal direction of the embedding image by 
computing the quotient of the largest and smallest singular 
values. 

We define first of all the context in which we look for 
signal estimators. Let F[n], n = 1, N, be a discrete signal 
of length N, and let X[n] = F[n] + W[n], n = 1,...,N, 
be a contaminated measurement of F[n], where W[n] are 
realizations of a white noise process W, throughout this 
paper we use the notation E(*) to denote the expected value 
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of a quantity *. 

Generally we are interested in estimators F such that 
the expected mean square error E{\f — F\ 2 } is as small as 
possible. For a given discrete orthonormal basis B — {g m } 
of the N dimensional space of discrete signals, we can write: 
x = Yl,m=l x B[rn}g m where X B [m] =< X,g m > is the 
inner product of X and g m . Given such notation, we can 
define a class of estimators that is amenable to theoretical 
analysis, namely the class of diagonal estimators of the form 
F = Y^i=l d rn{X B [m})g m where d m (X B [m]) is a function 
that depends only on the value of Xb[to]. One particular 
kind of diagonal estimator is the hard thresholding estimator 
Ft (for T some positive real number) defined by the choice 



N—l 



F T = ^ d m {X B [m})g ri 



(1) 



m=0 



where 



and 



d m (X B [m}) = X B [m] if \X B [m]\ > T 



d m (X B [m]) = otherwise. 



If W[n] are realizations of a white normal distribution 
with variance a 2 , then it is shown in [DJ] that Ft, with 
T = <j\ / 2logN, achieves almost minimax risk (when 
implemented in a wavelet basis) for the class of signals 
f[n] of bounded variation. The possibility of proving such a 
striking result is based, in part, on the fact that the coefficients 
Ws[n] are realizations of a Gaussian white noise process in 
any basis B. 

Several techniques have been developed to deal with 
the non-Gaussian case, some of the most successful are the 
Efromovich-Pinsker (EP) estimator (see for example [ELPT] 
and references threin) and the block threshold estimators 
of Cai and collaborators (see [CS],[C] and the more recent 
[CL]). In these methods, the variance of the white process 
needs to be estimated from the data, moreover, since the 
threshold is designed to evaluate intensities (or relative 
intensities) of the coefficients in blocks of multiwavelets, 
low intensity details may be filtered out as it is the case for 
simpler denoising methods (see also remark 3 on the issue of 
low intensity non-noisy features). 

The method we describe in this paper does not need 
the knowledge of the noise intensity level (thanks to the use 
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of quotients of singular values), and it is remarkably robust 
to changes in the type of noise distribution. 
This strenght is achieved at a price, the inner parameters of 
the algorithm need to be adjusted to the data, this is true to 
some extent for the EP and block thresholding algorithms 
as well (see again [ELPT] and [CL]), but the number and 
type of parameters that need to be trained in our approach is 
increased by the need of choosing a 'good' delay-coordinates 
embedding suitable for the data we would like to denoise. 
In section V we will explore possible ways to make the 
training on the data automatic, but it is yet to be seen at 
this stage which data sets are amenable to the analysis 
we propose. This paper is meant as a mostly experimental 
analysis that suggest the method is sound at least for one 
choice of data sets (namely, speech signals). 

Because of the choice of applying our algorithm to a 
database of speech signals, we decided to use windowed 
Fourier frames as a basic analytical tool. This is an obvious 
way in which we are already adapting to the data, but more 
general frames V could be used, or even collection of frames 
and bases, therefore we prefer to label I? as a dictionary of 
analysis. 

Note that any discrete periodic signal X[n], n e /Z with 
period N can be represented in a discrete windowed Fourier 
frame. The atoms in this frame are of the form 

g m .A n ] = 9[n - m]exp( — ), n£l (2) 

We choose the window g to be a symmetric iV-periodic 
function of norm 1 and support q. Specifically we can choose 
g to be the characteristic function of the [0, 1] interval; we 
realize that this may not be the most robust choice in many 
cases, but we have deliberately selected this function to avoid 
excessive smoothing which was found to adversely affect our 
algorithm. 

Under the previous conditions x can be completely 
reconstructed from the inner products TX\m, I] =< 
X,g m .i >, i-e., 

^ JV-1AT-1 

X = jyEE H Ww (3) 

m=0 1=0 

where 

9m,l[ n } = g[n~ m]exp(— — ), nE% (4) 

We denote the collection {< X,g m j >} by TX. For finite 
discrete signals of length N the reconstruction has boundary 
errors. However, the region affected by such boundary effects 
is limited by the size q of the support of g and we can 
therefore have perfect reconstruction if we first extend X 
suitably at the boundaries of its support and then compute 
the inner products TX. More details can be found in [S] and 
references therein. 

Since for speech signals much of the structure in the 
time frequency domain is contained in localized 'ridges' 



that are oriented in time direction, the collection C p of 
double-indexed paths 

lm,i = {dm,l suc h that I = T,fh < m < fa + p}, (5) 

where p is some positive integer, will be relatively sensitive 
to local time changes of such ridges, since each path is a 
short line in the time frequency domain oriented in the time 
direction. 

The choice of p is very important as different structure 
in speech signals (our main case study) is evident at different 
time scales. Let I = I(j rll f) = 1(TX 1 _ r ) be a function 
defined for each path j m j S C p . We define now a semi-local 
thresholding estimator in the window Fourier frame as 
follows: 

W-1JV-1 

f=-^EE d IiT (TX[m, l])g m!l (6) 

m=0 1=0 

where di, T {FX[m,l]) = FX[m,l] if I{TX lrnl ) > T for 
some 7 l7l j containing (m, I), and di t T(TX[m,l}) — if 
I{FX lfh r ) < T for all j m j containing (m, I). 

Note that this threshold estimator is build to mirror the 
diagonal estimators in (1), but that the 'semilocal' quality of 
F is evident from the fact that all coefficients in several TX 1 
are used to decide the action of the thresholding on each 
coefficient. This procedure is similar to block thresholding 
estimators, with the additional flexibility of choosing the 
index function /. We propose in the next section a novel 
use of embedding techniques from non-linear dynamical 
systems theory to choose a specific form for /. We find 
in this way a variance independent estimator that does not 
depend significantly on the probability distribution of the 
random variable W and such that we can adapt to the data in 
a flexible way. 



II. Delay-Coordinates Embedding Images of Time 
Series 

We first recall a fundamental result about reconstruction of 
the state space realization of a dynamical system from its time 
series measurements. Suppose S is a dynamical system, with 
state space M k and let h : M k — > M be a measurement, i.e., 
a continuous function of the state variables. Define moreover 
a function F of the state variables X as 

F(X) - [h(X),h(S- T (X)), h(S_ {d _ 1)T (X))} (7) 

where by S-j T (X) we denote the state of the system with 
initial condition X at jr time units earlier. 

We say that A C M k is an invariant set with respect 
to S if X e A implies S t (X) e A for all t. Then the 
following theorem is true (see [ASY], [SYC] and [KS]): 

Theorem: Let A be an m-dimensional submanifold of 
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JR k which is invariant under the dynamical system S. If 
d > 2m, then for generic measuring functions h and generic 
delays r, the function F defined in (7) is one-to-one on A. 

Keeping in mind that generally the most significant 
information about g is the knowledge of the attractive 
invariant subsets, we can say that delay maps allow to have 
a faithful description of the underlining finite dimensional 
dynamics, if any. The previous theorem can be extended to 
invariant sets A that are not topological manifolds; in that 
case more sophisticated notions of dimension are used (see 
[SYC]). 

Generally the identification of the 'best' r and d that 
allows for a faithful representation of the invariant subset 
is considered very important in practical applications (as 
discussed in depth in [KS]), as it allows to make transparent 
the properties of the invariant set itself, more particularly 
we want to deduce from the data itself the dimension m of 
the invariant set (if any) so that we can choose a d that is 
large enough for the theorem to apply. Moreover the size 
of t has to be large enough to resolve the image far from 
the diagonal, but small enough to avoid decorrelation of the 
delay coordinates point. 

We apply the structure of the embedding in such a way that 
the identification of the most suitable r and d is not so crucial 
, even though we will see that we do need to train such 
parameters on the available data, but in a much simpler and 
straightforward way. The technical reason for such robustness 
in the choice of parameters will be clarified later on, but 
essentially we use time delay embeddings as data mining 
tools rather than modelization tools as usually is the case. 

To understand how such data mining is possible, we 
start by applying the delay-coordinate procedure to the time 
series W[n], n = 1,...,N, for W an uncorrected random 
process; let the measuring function h be the identity function 
and assume from now on that r is an integer delay so that 
F(W[n}) = [W[n],W[n - r],...,W[n - (d - l)r]]. For 
any embedding dimension d, the state space will be filled 
according to a spherically symmetric probability distribution. 
Let now Z — {F(Z[n}), n = 1,...,N} be the embedding 
image in M d of a time series Z for any given time delay r. 
Then we have the following very simple, but fertile lemma 
that relates spherical distributions to their associated to 
principal directions 

Lemma 1: Let a\, ad be the variance of W along the 
first principal direction (of largest extent) and the last one 
(smallest) respectively. Then the expected value E{^} 
converges to 1 as N goes to infinity. 

Proof: Because W is a white noise process, each coordinate 
of F(W[n]) is a realization of a same random variable with 
some given probability density function g, therefore W is a 
realization of a multivariate random variable of dimension d 
and symmetric probability distribution. If the expected value 
of — = Q > 1, then a point at a distance from the origin of 
a 1 has a greater probability to lie along the principal direction 
associated to o\ contradicting the fact that the probability 



distribution of W was symmetric. 

Remark 1: Even when X is a pure white noise process, 
the windowed Fourier frame will enforce a certain degree of 
smoothness along each path 7 since consecutive points in 7 
are inner products of frame atoms with partially overlapping 
segments of X. So there will be some correlation in TX 1 
even when X is an uncorrected time series, therefore it is 
possible in general that /(.FXy) » 1 even when X is a 
white noise process. 

Remark 2: Similarly, the length p of 7 cannot be chosen 
very large in practice, while E(^l) converges to 1 for any 
uncorrected processes only asymptotically for very long time 
series and again for small length p we may have E(^) » 1. 

Even with the limitations explained in the previous two 
remarks, it is still meaningful to set I(X^) = I svd (X 1 ) = 
and therefore we define an embedding threshold estimator to 
be a semilocal estimator F (as in (2)) with the choice of index 
I = I svd t what we call an embedding index. The question 
is now to find a specific choice of T > 1, given a choice 
of (D,C p ,d,T), that allows to discriminate a given data set 
(speech signals in this paper) from white noise processes. 

We need therefore to study the value distribution of 
jsvd f or our S p ec jfi c choice of C p and V, and assuming X is 
either an uncorrected random process or a signal belonging 
to our class of speech signals. 

In the next section we explore numerically this issue 
for the windowed Fourier frames and the collection of paths 
C p in (5). 



III. Embedding Index of Speech Signals and 
Random Processes 

For a given times series X and choice of parameters 
(p, t, d) we can compute the collection of embedding indexes 
I svd (TX) = {I avd {TX^), 7 e C p }, Define now the index 
cumulative function as 

#{ 7 such that I^(TX,)>t} 
Qx (t) = ^ , (8) 

i.e. for a given t, Qx{t) is the fraction of paths that have 
index above t. 

A simple property of Qx will be crucial in the following 
discussion: 

Lemma 2: If X is a white noise process and X' = aX is 
another random process that component by component is a 
reseating of X by a positive number a, then the expected 
function Qx and Qx 1 are equal. 

Proof: Each set of embedding points generated by one specific 
path 7 is, coordinate by coordinate, a linear combination 



4 



of some set of points in the original time series. Therefore 
if X' = aX, TX' = aTX 1 , but the quotient of singular 
values of a set of points is not affected by rescaling of all 
coordinates, therefore the distributions of I svd (!FX) and 
I svd (J r X') are equal, but Qx' and Qx are defined in terms 
of I svd so they are equal as well. 

Remark 3: We see the use of embedding index as 
a possible generalization of methods like the coherent 
structures extraction of [M] section 10.5 (more details can 
be found in [DMA]), where it is explored the notion of 
correlation of a signal X of length N with a basis B, defined 
as 

c[x) ~ """m ' 

It turns out that in the limit N — > oo the correlation of any 
Gaussian white process converges to 

independently of the specific variance and therefore estimation 
of a signal X is performed by retaining a coefficient Xs[m] if 
> Cn- In this paper the embedding index determines 
the coherence of a coefficient with respect to a neighbourhood 
of the signal and it is independent of the variance of the noise 
process as well. 

Remark 4: As we said in section II, the choice of p 
in C p is very important in practice. The speech signals that 
we consider are sampled at a sampling frequency of about 
8100 pt/s, we choose supprt of the window q = 64 and length 
of the paths p = 2 8 , since these values seem to assure that 
each path will be significantly shorter than most stationary 
vocal emissions, a point to take into consideration when we 
gauge the relevance of our results. 

Given this lenght p for 7, we have some significant restrictions 
on the maximum embedding dimension d and time delay 
t that we can choose if we want to have for each path a 
sufficiently large number of points in the embedding image to 
be statistically significant, which we can obtain if p >> dr. 
Because of these restrictions we choose d = 4 and r = 4 
that give dr = 2 4 « p = 2 8 , we generate in this way 
240 points for each path. We heuristically tried to adjust the 
embedding parameters d and t and the lenght p of the paths 
so that the qualitative behaviour of speech signals and white 
noise processes was as distinct as possible, see the discussion 
in section IV for a possible way to make the choice of 
parameters automatic. 

We now expand some uncorrected zero mean random 
processes of length N = 2 11 on the windowed Fourier frame 
with the set values q = 64, p = 2 8 , d = 4 and r = 8. And 
we compute the embedding index Qx- 

The specific random processes we use here are time series 
with each point a realization of a random variables with: 



1) Gaussian probability density function. 

2) Uniform probability density function. 

3) Tukey probability density function, that is, a sum of two 
normal distributions with uneven weight (used in [ELPT] as 
well), each point of the time series is a realization of the 
random variable W = RN X + (1 - R)4N 2 /y/r + 16(1 - r), 
where N\ and N 2 are Gaussian random variables, and R 
is a Bernoulli random variable with P(R = 1) = 0.9 and 
r = P(R=l). 

4) discrete uniform pdf with values in {— Q,Q} for some 
positive Q. 

All probability density functions are set to have mean zero, 
and variance 1, since by Lemma 2 we know Q* will not be 
affected by changes of the variance. One of the pdf has heavy 
tail (Tukey pdf) and one of them is discrete (discrete uniform 
pdf). The kurtosis is respectively from pdf in 1) to pdf in 4): 
3, about 1.8, about 13, and about 1.2 

In Figure la we plot Qx(t) for the white noise processes 
generated with pdfs in l)-4), averaged over 10 repetitions for 
each random distribution. 

Remark 5: To speed up the computation, we sampled 
the indexes (in, I) of the paths in (5), more particularly we 
selected a sampling length of Sm = 1 for the frequency index 
m and a sampling length of Sj — p for the time index. 

Note that the qualitative behaviour of Qx is very similar for 
all chosen distributions, in particular they all exhibit a very 
fast decay for larger values of t. The maximum L 2 distance 
between any two Qx in the interval [0,40] is ss 0.54 (or 
some 6% of the average L 2 norm of the Qx) , we found that 
even for distribution with kurtosis up to 50 the maximum 
distance was less that 0.8 (about 8.5% of the average L 2 
norm of Qx), irrispective of the specific pdf, moreover most 
of the error is concentrated in regions of high intensity of the 
derivative and it does not affect much the behaviour of the 
right tail of the curves Qx- 

Therefore it seems that, for our choice of T> and C p , 
reasonably heavy tail distributions will not exibit a 
significantly different behaviour in Qx with respect 
to the Gaussian distribution, supporting our claim that Qx 
is robust with respect to the choice of white noise distribution. 

For each probability density function, the shape of Qx 
is affected by the correlation introduced by the length of q 
(the window support of the windowed Fourier Frame): if 
t < q some coordinates in each embedding point will be 
correlated and this will cause the decay of Qx to be slower 
when r is smaller. 

When Qx is computed (with the same choice of parameters) 
for a collection of 10 randomly selected segments of speech 
signals of length 2 11 , the rate of decay of the functions 
Qx is significantly different, and the tail of the functions 
is still considerably thick by the time the rate of decay of 
Qx for most random processes is almost zero (see Figure lb). 



5 




5 10 15 20 25 30 35 40 



Fig. 1. From top to bottom, this figure shows Q„, as defined in equation 
(7) for: a) uncorrected random processes l)to 4); b) ten randomly selected 
segment of speech signal from the TIMIT database. 



Since we want to have a significantly larger fraction of 
paths retained for speech signals rather than noise, we can 
select the threshold T in the following way: 

(A) Determination of Threshold Given a choice of 
parameters (V,C p ,P,T,d), a collection of training speech 
time series {Sj}, and a selection of white noise processes 
{Wi}, choose To to be the smallest t so that the mean of 
QSj(To) is one order of magnitute (10 times) larger than the 
mean of Q\Vi(To)- 

This heuristic rule gives, for the parameters in this section, 
T ~ 28.2. (A) gives us as experimental way to determine 
a threshold T — T for the index I svd that removes most 
of the time frequency structure of some predetermined 
noise distributions, while it preserves a larger fraction 
of the time frequency structure of speech signals. Since 
moreover 'reasonable' distributions exibited a Qx similar to 
the one of Gaussian distributions, we can in practice train 
the threshold only on Gaussian noise and be assured that 
it will be a meaningful value for a larger class of distributions. 

Note that even very low energy paths could have in principle 
high embedding index, still, the energy concentration in paths 
that have very high index tends to be large for speech signals, 
to see that, for a given signal X, let 

YM^XjU such that I^jTX,) > t} 

Ex{t) = ' (9) 

be the fraction of the total energy contained in paths with 
index above x. We can see in Figure 2 that the amount of 
energy contained in paths with high index value is significantly 
larger for speech signals than for noise distributions. 

More particularly, the fraction of the total energy of 
the paths carried by paths with I svd > To is on average 0.005 
for the noise distributions and 0.15 for the speech signals, or 
an increase by a factor of 30. 

It seems therefore that I svd 7 with our specific choice 
of parameters, is quite effective in separating a subset of 
paths that are likely to be generated by speech signals, note 
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Fig. 2. From top to bottom, this figure shows £*, as defined in equation 
(8) for: a) the uncorrected random processes in Figure la; b) the segments 
of speech signals in Figure lb. 



moreover that similar results can be obtained with local 
changes of p, t and d, which suggests an intrinsic robustness 
of the separation with respect of the parameters. 
This separation ability could be due, in principle, only to the 
very nice properties of speech signals. Note that if, for some 
!FX 7 , I svd = oo, then the state realization of the time series 
TX^ is embedded in a subspace of R d and therefore each 
point of TX 1 must be described as a linear function of the 
delay coordinates. This condition is very restrictive on the 
dynamics of TX 1 , but vocal emissions are locally periodic 
signals, and so they do fall, at least locally, into the class of 
linearly predictable discrete models, i.e., processes for which 
X k = r(X k _i, ...,X k _ d ) for some linear function r and for 
some integer d. 

The complexity of these linear models increases with 
increasing values of the embedding dimension d. But this 
is not fully satisfactory as we would like to be able to 
use the embedding index I svd to denoise more complex 
dynamics that cannot be described by simple linear predictive 
models. Moreover for small r we are measuring in many 
cases smoothness of the path and local correlation with the 
embedding index, yet, if we try to choose r as large as 
possible with still a clear separation of the training sets, 
we can see differences that are not accounted for by local 
correlation, indeed the embedding image is squeezed along 
the diagonal for paths with high local smoothness, but in 
principle for complex dynamics the principal direction could 
be oriented in any direction and therefore the embedding index 
is much more than simply a measure of local smoothness. 

There is a large literature on possible ways to distinguish 
complex dynamical systems from random behaviour (see for 
example the articles collected in [Me]), as we underlined 
in the previous section, much of this work stresses the 
identification of the proper embedding parameters r and d; 
the contribution of this paper to this ongoing discussion is the 
use of embedding techniques in the context of computational 
harmonic analysis. This context frees us from the need to 
use embedding techniques to find an effective modelization 
of the signals, such 'blind' use of the embedding theorem is, 
we believe, fertile from a practical point of view, as well as 
a theoretical one. 
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Note in any case that if the dimension of the invariant 
set A is d>A = 0, then for any white noise process W, X + W 
has spherically symmetric embedding image and ^ w 1 for 
any embedding dimension d as in the case of pure white 
noise. This means that an estimator based on I svd is not able 
to estimate noisy constant time series on a given path 7. 
This restriction can be eased by allowing information on the 
distance of the center of the embedding image to be included 
in the definition of the embedding threshold estimator. In this 
paper for simplicity we assumed cU > for all paths in C p . 
That seems to be sufficient in analyzing speech signals. 

IV. Attenuated Embedding Estimators 

In this section we develop an algorithm based on these 
ideas. The notion of semilocal estimator is slightly expanded 
to improve the actual performance of the estimator itself. To 
this extent, define tubular neighborhoods for each atom in the 
windowed Fourier frame, i.e.: 

0(g m ,i) = {g m >,i> s.t. \V -1\<1, \m' -m\< 1}, (10) 

Such neighborhoods are used in the algorithm as a way to 
make a decision on the value of the coefficients in a two 
dimensional neighborhood of TX 1 based on the the analysis 
of the one dimensional time series FX 1 itself. 

(CI) Set F = 0. 

(C2) Given X, choose q > and expand X in a windowed 
Fourier frame with window size q. 

(C3) Choose sampling intervals Sj for time coordinate 
and Sm for the frequency coordinate. Choose the path length 
p. Build a collection of paths C p as in (5). 

(C4) Choose embedding dimension d and delay r along 
the path. Compute the index I svd (J r X 1 _ i r ) for each 
^FX lfH - G C p . Use (A) to find the threshold level T. 

(C5) Choose attenuation coefficient a. Set FY[m,l] = 
aTX\m 1 V\ if I svd {TX 1 ) > T for some 7 containing 
9m',l'> 9m', V € 0(g m .i), otherwise set FY[m, I] = if 
pvd^px^ < T for all 7 containing g m '.y, g m ',v S 0(g m ,i)- 

(C6) Let Y be the inversion of FY. Set F = F + Y 

and X = X - Y. 

(C7) Choose a paramenter e > 0, if \Y\ > e go to 
step (C2). 

Note that the details of the implementation (C1)-(C7) 
are in line with the general strategy of matching pursuit. 
The window length q in step (C2) could change from one 
iteration to the next to 'extract' possible structure belonging 
to the underlining signal at several different scales. In the 
experiments performed in the following section we alternate 
between two window sizes qi and q 2 . 



The attenuation introduced in (C5) has some additional ad hoc 
parameters in the definition of the neighborhoods in (10) and 
in the choice of the attenuation parameter a. By the double 
process of increasing the number of nonzero coefficients 
chosen at each step and decreasing their contribution we 
are allowing more information to be taken at each iteration 
of the projection pursuit algorithm, but in a slow learning 
framework that in principle (and in practice as we found 
out) should increase the sharpness of the distinct features 
of the estimate, on the general issue of attenuated learning 
processes see the discussion in [HTF] chapter 10. Note that 
the attenuation coefficient leads to improved results only 
when it is part of a recursive algorithm, otherwise it gives 
only a rescaled version of the estimate. 

One drawback of the algorithm we described is the 
need to choose several parameters: we choose a dictionary of 
analysis T>, a collection of discrete paths C p , the embedding 
parameters r (time delay) and d (embedding dimension), and 
the learning parameters T (threshold level), a (attenuation 
coefficient) and e. Again we stress that all such choices 
are context dependent, and are the price to pay to have 
an estimator that is relatively intensity independent and 
applicable to wide classes of noise distributions. 
The choice of T> is dependent on the type of signals we 
analyze and we do not see a serious need to make such 
choice automatic. 

Since we analyze speech signals, we choose the dictionary 
to be the set of atoms of the windowed Fourier frames; the 
algorithm is not very sensitive to the choice of the length q 
of the window in the Fourier frame, while the use of several 
windows is found to be always beneficial. 
The choice of C p is also dependent on the type of signals 
analyzed, speech signals have specific frequencies that change 
in time, so a set of paths parallel to the time axis was natural 
in this case. Let us explore now the relation of parameters 
associated with C p , embedding parameters r and d and 
threshold T. Recall that for the collection C p we have as 
parameters the time and frequency sampling rates I and rh 
and the length p of the paths. The frequency sampling rates I 
and m are necessary only to speed up the algorithm, ideally 
we would like a dense sampling. Same considerations apply 
to the 'thickening' of the paths in (10), we basically try 
to speed up the algorithm by collecting more data at each 
iteration. 

So the only essential parameters are the path length p, the 
embedding parameters and the threshold T 
Essentially we want to set these parameters so that the 
number of paths that have index I svd > T is sizeable for 
a training set of speech signals and marginal for the white 
noise time series of interest. 

Our experience is that such choice is possible and robust, we 
gave a simple rule to find the threshold T in step (A) in the 
previous section given a choice of (p, r, d). 
A learning algorithm could be built to find T, the paths' 
length p, and the embedding parameters, namely let Qs(x) 
be the mean of the functions QsAx) for a training set of 
speech signals Si and Qw{x) be the mean of the functions 



7 



Q\Vi (x) for a set of white noise time sieries Wi 
We can first find d, t and p such that the distance of the 
functions Qw(x) and Qs(x) is maximum in the L 2 norm. 
After finding these parameters, we can find a value of T 
such that T is the smallest positive number with Qs(T) one 
order of magnitude larger than Qw(T), as we did in (A) in 
the previous section, to make our algorithm automatically 
applicable to data sets of interest different from speech signals 
it will be necessary to formalize this optimization procedure. 

Finally the choice of a and e is completely practical in 
nature, ideally we want a and e as close to zero as possible, 
but, to avoid making the algorithm unreasonably slow, 
we must set values that are found to give good quality 
reconstructions on some training set of speech signals while 
they require a number of iterations of the algorithm that is 
compatible with the computing and time requirements of 
the specific problem. For longer time series, as the ones 
in the next section, we segment the data in several shorter 
pieces, and we iterate the algorithm a fixed number of times 
k rather than using e in (C7) to decide the number of iterations. 

Note:The algorithm described in this paper is being patented, 
with provisional patent application number 60/562,534 filed 
on April 16, 2004. 

V. Denoising 

In this section we explore the quality of the attenuated 
embedding threshold as implemented in the windowed Fourier 
frame and with our class of paths C p . We apply the algorithm 
to 10 speech signals from the TIMIT database contaminated 
by different types of white noise with several intensity levels. 
We show that the attenuated embedding threshold estimator 
performs well for all white noise contaminations we consider. 
The delay along the paths is chosen as r = 4, the length of 
the paths is p = 2 8 and the window length of the windowed 
Fourier transform alternates between q = 100 and q = 25 
(to detect both features with good time localization and those 
with good frequency localization), the embedding dimension 
d = 4. For these parameters and for the set of speech signals 
that we used as training, we have T w 26.8 when q = 100 
and T ps 27.4 when q — 25 using the procedure (A) of section 
III. 

The sampling interval of the paths in the frequency direction 
is Sm = 3 and along the time direction is Si — p/2 We 
select a = 0.1, as small values of a seem to work best (see 
discussion in the previous section). The algorithm is applied to 
short consecutive speech segments to reduce the computational 
cost of computing the windowed Fourier transform on very 
long time series, therefore, to keep the running time uniformly 
constant for all such segments, we decided to iterate the 
algorithm (C1)-(C6) a fixed number of times (say 6 times) 
instead of choosing a parameter e in (C7). 
As we already said, the window size q in (C2) alternates 
between q = 100 and q = 25. It is moreover important 
to note that the attenuated embedding threshold is able to 
extract only a small fraction of the total energy of the signal 





Fig. 3. Scaled SNR gain in decibel of the attenuated embedding estimates 
plotted against the scaled SNR of the corresponding measurements. From top 
left in clockwise order we consider the case of: a)Gaussian white noise; b) 
uniform noise; c)Tukey white noise; d)discrete bimodal distribution . 



/, exactly because of the attenuation process, therefore the 
Signal-to-Noise Ratio (SNR) computations are done on scaled 
measurements X, estimates F, and signals F set to be all 
of norm 1. We call such estimations scaled SNR, and we 
explicitely write, for a given signal F and estimation Z, 



SNR S (Z) = lOlog 



1 



HI 



E(\F/\F\-Z/\Z\) 



We then compute SNR S (X) and SNR S (F) by 
approximating the expected values — 
and — F/\F\) with an average over several 

realizations for each white noise contamination. 

In Figure 3 we show the gains of the scales SNR of 
the reconstructions (with the attenuated embedding threshold 
estimator) plotted against the corresponding scaled SNR 
of the measurements. Each curve correspond to one of 10 
speech signals of approximately one second used to test 
the algorithm. From top left in clockwise direction we have 
measuremets contaminated by random processes with pdfs 
1) to 4) as defined in section III and with several choices of 
variance. Note that the overall shape of the scaled SNR gain is 
similar for all distributions (notwithstanding that the discrete 
plots do not have exactly the same domain). The maximum 
gain seems to happen for measurements with scaled SNR 
around 1 decibel. Note that the right tail of the SNR gains 
takes often negative values; this is due to the attenuation 
effect of the estimator that is pronunced for the high intensity 
speech features, but it is not necessarily indicative of worse 
perceptual quality with respect to the measurements, some of 
the figures in the following will clarify this point. 

In the first case of Gaussian white noise, we compared 
our algorithm to the block thresholding algorithm described 
in [CS], we used the matlab code implemented by [ABS], 
made available at www.jstatsoft.org/v06/i06/codes/ as a 
part of their thourogh comparison of denoising methods. As 
the block thresholding estimator is implemented in a symmlet 
wavelet basis that is not well adapted to the structure of 



Fig. 4. Signal 'SPEECH10' scaled to have norm 1. 



Fig. 7. Signal 'SPEECH5' scaled to have norm 1. 




Fig. 5. Noisy scaled measurement of SPEECH 10 with Gaussian white noise 
and scaled SNR of about ldb. 



speech signals, a more compelling comparison would require 
the development of an embedding threshold estimator in a 
wavelet basis, we plan to do so in a future work. In Figure 
10 we show the scaled SNR gain for all tested speech signals 
using the block threshold estimator (right plot) and attenuated 
embedding estimator (left plot). In Figure 4 we show one 
original speech signal, Figure 5 shows the measurement in 
the presence of Gaussian noise corresponding to the 'peak' 
of the SNR S gain curve (measurement SNR S sa 1), Figure 
6 shows the corresponding reconstruction with attenuated 
embedding threshold estimator. Similarly Figure 7 shows 
another speech signal, while Figure 8 shows the measurement 
with Tukey noise corresponding to the 'peak' of the Tukey 
noise SNR S gain curve (measurement SNR S w 1), Figure 
9 shows the reconstruction. In both cases the perceptual 
quality is better than the noisy measurements, which is not 
necessarily the case for estimators in general. 
Note moreover that even though T was found using only 
Gaussian white noise as the training distribution, none of the 
parameters of the algorithm were changed as we went from 
Gaussian white noise contaminations to more general white 
noise processes, and yet the SNR S gain was similar, it must 





Fig. 8. Noisy measurement of SPEECH5 with Tukey white noise and scaled 
SNR of about ldb. 

be noted though that the estimates for bimodal and uniform 
noise were not intelligible at the peak of the SNR S gain 
curve (just as the measurements were not). 

Since the performance of the embedding estimator is 
not well represented by the scaled SNR for low intensity 
noise (measurements appear to be better than the estimates), 
in Figures 10 to 21 we show two more instances of 
speech signals contaminated by lower variance Tukey noise, 
Gaussian noise and discrete bimodal noise (uniform noise 
leads to reconstructions very similar to the discrete bimodal 
distribution), for one case of low Gaussian white noise 
we show a block thresholding estimate, note how the low 
intensity details are lost, this inability to preserve low intensity 
details worsens when higher variance noise is added, but then 
again, it must be tempered by the fact that a standard wavelet 
basis is not well adapted to the structure of speech signals. 




Fig. 6. Attenuated embedding estimate of SPEECH 10 from the measurement 
in Figure 6, scaled to have norm 1. 



Fig. 9. Attenuated embedding estimate of SPEECH5 from the measurement 
in Figure 9, scaled to have norm 1. 
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Fig. 10. SNR g gain for the estimates of 10 speech signals and Gaussian 
additive noise using: the block thresholding estimator of [CS](right), the 
embedding threshold estimator(left). 




Fig. 11. Signal 'SPEECH2' scaled to have norm 1. 




Fig. 12. Noisy measurement of SPEECH2 with Tukey white noise and scaled 
SNR of about 4.4db. 




Fig. 13. Attenuated embedding estimate of SPEECH2 from the measurement 
in Figure 12, scaled to have norm 1, SNR S is £3 8.1db. 





Fig. 15. Attenuated embedding estimate of SPEECH2 from the measurement 
in Figure 14, scaled to have norm 1, SNR a is £3 8.1db. 




Fig. 16. Signal 'SPEECH7' scaled to have norm 1. 




Fig. 17. Noisy measurement of SPEECH7 with Tukey white noise and scaled 
SNR of about 7.3db. 



Fig. 14. Noisy measurement of SPEECH2 with bimodal white noise and 
scaled SNR of about 4.5db. 
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Fig. 18. Attenuated embedding estimate of SPEECH7 from the measurement 
in Figure 17, scaled to have norm 1, SNR S is PS 6. 




Fig. 19. Noisy measurement of SPEECH7 with Gaussian white noise and 
scaled SNR of about ll.ldb. 

Data files for the signal, measurement and reconstructions 
used to compute the quantities in all the figures are available 
upon request for direct evaluation of the perceptual quality. 

VI. Further Developments 

Given that the embedding threshold ideas were implemented 
with the specific goal of denoising speech signals, it may be 
worth emphasizing that in principle the construction of classes 
of paths can be applied to other dictionaries well adapted to 
other classes of signals, more paricularly, let T> = {gi, ...,gp} 
be a generic frame dictionary of P > N elements so that 
x = ELi^[ ra lfr' X v [m] =< X,g m >, where g m 
are dual frame vectors (see [M] ch.5). Given such a general 
representation for X, let C p = {71,..., 7q}, Q > P, be 
a collection of ordered subsets of T> of length p, that is, 
7i — {9ii>---)9i }> so that (J7j = T> and the cardinality 
of the set {7^ such that gj € 7^} is constant for every 
j = 0,...,P — 1 (this ensures that the discrete covering of 
the frame atoms is locally uniform). Note that C p needs not 
be the entire set of ordered subsets of V. We call each 7$ a 
'path' in T> for reasons that will be clear in the following. 
Let X li = {X-p[m] =< X,g m >, g m G 7^} be an ordered 




Fig. 20. Attenuated embedding estimate of SPEECH7 from the measurement 
in Figure 19, scaled to have norm \,SNR a is m 7.7. 




Fig. 21. Block thresholding estimate of SPEECH7 from the measurement in 
Figure 19, scaled to have norm \,SNR a is 7.6, note low intensity details 
are removed by the estimator. 

collection of coefficients of X in the dictionary T>. 

Then a a semi-local estimator in T> can be defined as: 

p-i 

F= di, T (X v [m])g m (11) 

m— 

where di T{Xv[m]) = Xv[m] if IiX^) > T for some 7 
containing m, and di,T{Xv\m\) — if I{X 1 ) < T for all 7 
containing m. 



The construction of significant sets of paths C p will 
depend from the application, we are currently exploring 
even the possibility of using random walks along the atoms 
of the dictionary V. In any case, after C p is selected, our 
specifc choice of index I svd can be used and the attenuated 
embedding estimator can certainly be applied and tested, 
soft threshold embedding estimators are an interesting open 
possibility as well. 
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